MLDB provides a complete implementation of the SQL SELECT statement. Most of the functions you are accustomed to using are available in your queries.
MLDB is different from traditional SQL databases in that there is no enforced schema on rows, allowing you to work with millions of columns of sparse data. This makes it easy to load and manipulate sparse datasets, even when there are millions of columns. To reduce the size of your dataset or use only specific variables, we may need to select columns based on specific critera. Column Expressions is an MLDB extension that provides additional control over your column selection. With a column expression, you can programmatically return specific columns with a SQL SELECT
statement.
In this tutorial, we will provide examples of COLUMN EXPR
within SELECT
statements. This tutorial assumes familiarity with Procedures and Datasets. We suggest going through the Procedures and Functions Tutorial and the Loading Data Tutorial beforehand.
The notebook cells below use pymldb
's Connection
class to make REST API calls. You can check out the Using pymldb
Tutorial for more details.
In [12]:
from pymldb import Connection
mldb = Connection()
Let's begin by loading and visualizing our data. We will be using the dataset from the Virtual Manipulation of Datasets Tutorial. We had chosen the tokenize
function to count the number of words in the Wikipedia descriptions of several Machine Learning concepts (please check out the tutorial for more details).
In [13]:
print mldb.put("/v1/procedures/import_ML_concepts", {
"type":"import.text",
"params": {
"dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
"outputDataset":{
"id":"ml_concepts",
"type": "sparse.mutable"
},
"named": "Concepts",
"select": """
tokenize(
lower(Text),
{splitChars: ' -''"?!;:/[]*,().',
minTokenLength: 4}) AS *
""",
"runOnCreation": True
}
})
Each word is represented by a column and each Machine Learning concept by a row. We can run a simple SELECT
query to take a quick look at the first 5 rows of our dataset.
In [14]:
mldb.query("SELECT * FROM ml_concepts LIMIT 5")
Out[14]:
There are 286 columns, some of which may or may not be useful to the data analysis at hand. For example, we may want to rebuild a dataset with:
This can be done in a few queries as you will see below.
Column Expressions provide efficient ways of picking and choosing our columns. For example, we can only choose verbs and adverbs that end with "ing" to understand the overall meaning of a description.
We use the columnName
column expression function along with the LIKE
SQL expression, as you will see below.
In [15]:
mldb.query("""
SELECT COLUMN EXPR (WHERE columnName() LIKE '%ing')
FROM ml_concepts
LIMIT 5
""")
Out[15]:
This is very powerful because the LIKE
statement in Standard SQL is typically found in row operations and more rarely in column operations. MLDB makes it simple to use such SQL expressions on columns.
With Column Expressions, we can select columns based on specific row selection criteria. COLUMN EXPR</code> will allow us for example to choose words that appear in multiple descriptions. In this case, we filter on words that show up at least 4 times.
To achieve the desired outcome, we use a Built-in Function available in column expressions called rowCount
. rowCount
iterates through each column and returns the number of rows that have a value for the specific column.
In [16]:
mldb.query("""
SELECT COLUMN EXPR (WHERE rowCount() > 4)
FROM ml_concepts
""")
Out[16]:
The results make sense. The words that we found above in the columns are common in Machine Learning concept descriptions. With a plain SQL statement and the rowCount
function, we reduced our dataset to include words that appear at least 4 times.
Nested JSON objects can have complex schemas, often involving multi-level and multidimensional data structures. In this section we will create a more complex dataset to illustrate ways to simplify data structures and column selection with Built-in Function and Column Expression.
Let's first create an empty dataset called 'toy_example'.
In [17]:
# create dataset
print mldb.put('/v1/datasets/toy_example', { "type":"sparse.mutable" })
We will now create one row in the 'toy_example' dataset with the 'row1' JSON object below.
In [18]:
import json
row1 = {
"name": "Bob",
"address": {"city": "Montreal", "street": "Stanley"},
"sports": ["soccer","hockey"],
"friends": [{"name": "Mich", "age": 25}, {"name": "Jean", "age": 28}]
}
# update dataset by adding a row
mldb.post('/v1/datasets/toy_example/rows', {
"rowName": "row1",
"columns": [["data", json.dumps(row1), 0]]
})
# save changes
mldb.post("/v1/datasets/toy_example/commit")
Out[18]:
We will check out our data with a SELECT
query.
In [19]:
mldb.query("SELECT * FROM toy_example")
Out[19]:
There are many elements within the cell above. We will need to better structure elements within the nested JSON object.
To understand and query nested JSON objects, we will be using a Built-in Function called parse_json</code> and a Column Expression
columnPathElement</code>.
This is where the parse_json
function comes in handy. It will help us turn a multidimensional JSON object into a 2D dataset.
In [20]:
mldb.query("""
SELECT parse_json(data, {arrays: 'parse'}) AS *
FROM toy_example
""")
Out[20]:
parse_json
is a powerful feature since we can create 2D representations out of multidimensional data. We can read all of the elements of the JSON object on one line. It is also easier to SQL as we will see below.
columnPathElement
makes it convenient to navigate specific parts of the data structure. In the next block of code, we will do the following:
parse_json
to parse each data element of the object on one row (same as above)columnPathElement
where the the column path name at index = 2 is 'name' (note that 'friends' is at index = 0)
In [21]:
mldb.query("""
SELECT COLUMN EXPR (WHERE columnPathElement(2) = 'name')
FROM (
SELECT parse_json(data, {arrays: 'parse'}) AS * NAMED rowPath() FROM toy_example
)
""")
Out[21]:
We now know the name of Bob's two friends... As you may have noticed, this is very practical if we want to query a specific attribute of a nested object. The columnPathElement
Column Expression allows us to easily query specific JSON data levels or dimensions.
Column operations such as the ones shown in this tutorial can be difficult without column expressions. Column Expressions offer a compact and flexible way to programmatically select columns. It is a great tool to carve out the data that is most needed for your analysis.
In this tutorial, we covered three Column Expressions:
columnName
which returns the name of the columns inside our datasetrowCount
which returns the number of non-empty rows for each columncolumnPathElement
which allows us to chose columns at specific sub-levelsCheck out the other Tutorials and Demos.